Shortest triplet clustering: reconstructing large phylogenies using representative sets
نویسندگان
چکیده
BACKGROUND Understanding the evolutionary relationships among species based on their genetic information is one of the primary objectives in phylogenetic analysis. Reconstructing phylogenies for large data sets is still a challenging task in Bioinformatics. RESULTS We propose a new distance-based clustering method, the shortest triplet clustering algorithm (STC), to reconstruct phylogenies. The main idea is the introduction of a natural definition of so-called k-representative sets. Based on k-representative sets, shortest triplets are reconstructed and serve as building blocks for the STC algorithm to agglomerate sequences for tree reconstruction in O(n2) time for n sequences. Simulations show that STC gives better topological accuracy than other tested methods that also build a first starting tree. STC appears as a very good method to start the tree reconstruction. However, all tested methods give similar results if balanced nearest neighbor interchange (BNNI) is applied as a post-processing step. BNNI leads to an improvement in all instances. The program is available at http://www.bi.uni-duesseldorf.de/software/stc/. CONCLUSION The results demonstrate that the new approach efficiently reconstructs phylogenies for large data sets. We found that BNNI boosts the topological accuracy of all methods including STC, therefore, one should use BNNI as a post-processing step to get better topological accuracy.
منابع مشابه
Phylogenomic Analysis Using Bayesian Congruence Measuring
Phylogenomic analysis of large sets of molecular characters, primarily DNA and proteins, provides great opportunities to estimate and understand important evolutionary processes. However, molecular phylogenies inferred from individual loci often differ. This incongruence among phylogenies can be the result of systematic error, but can also be the result of different evolutionary histories. We p...
متن کاملAn Incremental DC Algorithm for the Minimum Sum-of-Squares Clustering
Here, an algorithm is presented for solving the minimum sum-of-squares clustering problems using their difference of convex representations. The proposed algorithm is based on an incremental approach and applies the well known DC algorithm at each iteration. The proposed algorithm is tested and compared with other clustering algorithms using large real world data sets.
متن کاملToward an Efficient Method of Identifying Core Genes for Evolutionary and Functional Microbial Phylogenies
Microbial community metagenomes and individual microbial genomes are becoming increasingly accessible by means of high-throughput sequencing. Assessing organismal membership within a community is typically performed using one or a few taxonomic marker genes such as the 16S rDNA, and these same genes are also employed to reconstruct molecular phylogenies. There is thus a growing need to bioinfor...
متن کاملOn the hardness of inferring phylogenies from triplet-dissimilarities
This work considers the problem of reconstructing a phylogenetic tree from triplet dissimilarities, which are dissimilarities defined over taxontriplets. Triplet dissimilarities are possibly the simplest generalization of pairwise dissimilarities, and were used for phylogenetic reconstructions in the past few years. We study the hardness of finding a tree best fitting a given triplet-dissimilar...
متن کاملClustering and Reconstructing Large Data Sets
by Piyush Kumar Doctor of Philosophy in Computer Science Stony Brook University June, 2004
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- BMC Bioinformatics
دوره 6 شماره
صفحات -
تاریخ انتشار 2005